In this project the red wine data will be analysed. The main aim of the project is to understand which of variables in the dataset impact the quality of the wine. This will be understood by performing Exploratory Data Analysis(EDA) on the dataset. We will perform Univariate analysis, Bivariate Analysis and Multivariate analysis on the variables to understand the data and variables.
setwd("/udacity")
getwd()
## [1] "C:/udacity"
redWineData <- read.csv(file="c:/udacity/wineQualityReds.csv", header=TRUE, sep=",")
The data has been loaded into the redWineData, we will be running the str function on the dataset to view the variables present.
str(redWineData)
## 'data.frame': 1599 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
dim(redWineData)
## [1] 1599 13
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
Output variable (based on sensory data): 12 - quality (score between 0 and 10)
redWineData$rating <- ifelse(redWineData$quality < 5, 'bad', ifelse(
redWineData$quality < 7, 'average', 'good'))
redWineData$rating <- ordered(redWineData$rating,
levels = c('bad', 'average', 'good'))
redWineData$total_acidity <- redWineData$fixed.acidity + redWineData$volatile.acidity
str(redWineData)
## 'data.frame': 1599 obs. of 15 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
## $ rating : Ord.factor w/ 3 levels "bad"<"average"<..: 2 2 2 2 2 2 2 3 3 2 ...
## $ total_acidity : num 8.1 8.68 8.56 11.48 8.1 ...
summary(redWineData)
## X fixed.acidity volatile.acidity citric.acid
## Min. : 1.0 Min. : 4.60 Min. :0.1200 Min. :0.000
## 1st Qu.: 400.5 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090
## Median : 800.0 Median : 7.90 Median :0.5200 Median :0.260
## Mean : 800.0 Mean : 8.32 Mean :0.5278 Mean :0.271
## 3rd Qu.:1199.5 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420
## Max. :1599.0 Max. :15.90 Max. :1.5800 Max. :1.000
## residual.sugar chlorides free.sulfur.dioxide
## Min. : 0.900 Min. :0.01200 Min. : 1.00
## 1st Qu.: 1.900 1st Qu.:0.07000 1st Qu.: 7.00
## Median : 2.200 Median :0.07900 Median :14.00
## Mean : 2.539 Mean :0.08747 Mean :15.87
## 3rd Qu.: 2.600 3rd Qu.:0.09000 3rd Qu.:21.00
## Max. :15.500 Max. :0.61100 Max. :72.00
## total.sulfur.dioxide density pH sulphates
## Min. : 6.00 Min. :0.9901 Min. :2.740 Min. :0.3300
## 1st Qu.: 22.00 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500
## Median : 38.00 Median :0.9968 Median :3.310 Median :0.6200
## Mean : 46.47 Mean :0.9967 Mean :3.311 Mean :0.6581
## 3rd Qu.: 62.00 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300
## Max. :289.00 Max. :1.0037 Max. :4.010 Max. :2.0000
## alcohol quality rating total_acidity
## Min. : 8.40 Min. :3.000 bad : 63 Min. : 5.120
## 1st Qu.: 9.50 1st Qu.:5.000 average:1319 1st Qu.: 7.680
## Median :10.20 Median :6.000 good : 217 Median : 8.445
## Mean :10.42 Mean :5.636 Mean : 8.847
## 3rd Qu.:11.10 3rd Qu.:6.000 3rd Qu.: 9.740
## Max. :14.90 Max. :8.000 Max. :16.285
The individual variables will be analysed before finding their impact on the quality of wine. This will help us understand the nature of each variable.
Import all the required libraries
library("ggplot2")
library("dplyr")
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library("gridExtra")
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
#library(Simpsons)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following objects are masked from 'package:dplyr':
##
## collect, recode, rename
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
library(pander)
##
## Attaching package: 'pander'
## The following object is masked from 'package:GGally':
##
## wrap
library(corrplot)
## corrplot 0.84 loaded
library(MASS)
library(Hmisc)
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:memisc':
##
## %nin%, html
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
library(pastecs)
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:dplyr':
##
## first, last
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(easyGgplot2)
Here is plot for quality:
ggplot(data=redWineData, aes(x=quality)) + geom_bar(width = 1, color = 'Brown', fill ='sky blue')
describe(redWineData$quality)
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 5.64 0.81 6 5.59 1.48 3 8 5 0.22 0.29
## se
## X1 0.02
stat.desc(redWineData$quality)
## nbr.val nbr.null nbr.na min max
## 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00
## range sum median mean SE.mean
## 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02
## CI.mean.0.95 var std.dev coef.var
## 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01
describe(redWineData$quality) vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 5.64 0.81 6 5.59 1.48 3 8 5 0.22 0.29 0.02
stat.desc(redWineData$quality) nbr.val nbr.null nbr.na min max 1.599000e+03 0.000000e+00 0.000000e+00 3.000000e+00 8.000000e+00 range sum median mean SE.mean 5.000000e+00 9.012000e+03 6.000000e+00 5.636023e+00 2.019555e-02 CI.mean.0.95 var std.dev coef.var 3.961255e-02 6.521684e-01 8.075694e-01 1.432871e-01
#function to plot graph with x and y limits
plot_no_lim <- function (var,var_name)
{
title <- paste(var_name," distribution")
grid.arrange(ggplot(redWineData, aes( x = 1, y = var ) ) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ) +
scale_y_continuous() + labs(y = var_name) + ggtitle(title),
ggplot(data = redWineData, aes(x = var)) +
geom_histogram(binwidth = 1, color = 'black',fill = I('orange')) +
scale_x_continuous() + labs(x = var_name) + ggtitle(title),ncol = 2)
describe(var)
}
#function to plot graphs with x and y limits set to understand the distribution of data
plot_with_lim <- function (var,l1,l2,var_name,b1)
{
title <- paste(var_name," distribution")
grid.arrange(ggplot(redWineData, aes( x = 1, y = var ) ) +
geom_jitter(alpha = 0.1 ) +
geom_boxplot(alpha = 0.2, color = 'red' ) +
scale_y_continuous(lim = c(l1,l2)) + labs(y = var_name) +ggtitle(title),
ggplot(data = redWineData, aes(x = var)) +
geom_histogram(binwidth = b1, color = 'black',fill = I('orange')) +
scale_x_continuous(lim = c(l1,l2)) + labs(x = var_name) + ggtitle(title),ncol = 2)
}
#plot the graphs for each of the variable of interest to understand the distrinution
plot_no_lim(redWineData$fixed.acidity,"fixed.acidity")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 8.32 1.74 7.9 8.15 1.48 4.6 15.9 11.3 0.98 1.12
## se
## X1 0.04
#setting the limits for x and y axis to 4 and 14 as there is maximum distribution of data between these values
plot_with_lim(redWineData$fixed.acidity,4,14,"fixed.acidity",1)
## Warning: Removed 8 rows containing non-finite values (stat_boxplot).
## Warning: Removed 9 rows containing missing values (geom_point).
## Warning: Removed 8 rows containing non-finite values (stat_bin).
Result of describe function vars n mean sd median trimmed mad min max range skew kurtosis X1 1 1599 8.32 1.74 7.9 8.15 1.48 4.6 15.9 11.3 0.98 1.12 se X1 0.04
#plotting for volatile acidity
plot_no_lim(redWineData$volatile.acidity,"volatile.acidity")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 0.53 0.18 0.52 0.52 0.18 0.12 1.58 1.46 0.67 1.21
## se
## X1 0
#setting the limits for x and y axis to 0 and 1 as there is maximum distribution of data between these values
plot_with_lim(redWineData$volatile.acidity,0,1,"volatile.acidity",0.25)
## Warning: Removed 21 rows containing non-finite values (stat_boxplot).
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 21 rows containing non-finite values (stat_bin).
Output of Describe() vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.53 0.18 0.52 0.52 0.18 0.12 1.58 1.46 0.67 1.21 0
#plotting for ph
plot_no_lim(redWineData$pH,"pH")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 3.31 0.15 3.31 3.31 0.15 2.74 4.01 1.27 0.19 0.8
## se
## X1 0
#setting the limits for x and y axis
plot_with_lim(redWineData$pH,0,6,"pH",0.2)
Output vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 3.31 0.15 3.31 3.31 0.15 2.74 4.01 1.27 0.19 0.8 0
#plotting for citric.acid
plot_no_lim(redWineData$citric.acid,"citric.acid")
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 0.27 0.19 0.26 0.26 0.25 0 1 1 0.32 -0.79 0
#setting the limits for x and y axis
plot_with_lim(redWineData$citric.acid,-1,2,"citric.acid",0.08)
vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.27 0.19 0.26 0.26 0.25 0 1 1 0.32 -0.79 0
#plotting for residual.sugar
plot_no_lim(redWineData$residual.sugar,"residual.sugar")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 2.54 1.41 2.2 2.26 0.44 0.9 15.5 14.6 4.53 28.49
## se
## X1 0.04
#setting the limits for x and y axis
plot_with_lim(redWineData$residual.sugar,1,8,"residual.sugar",0.1)
## Warning: Removed 23 rows containing non-finite values (stat_boxplot).
## Warning: Removed 23 rows containing missing values (geom_point).
## Warning: Removed 23 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 2.54 1.41 2.2 2.26 0.44 0.9 15.5 14.6 4.53 28.49 0.04
#plotting for citric.acid
plot_no_lim(redWineData$chlorides,"chlorides")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 0.09 0.05 0.08 0.08 0.01 0.01 0.61 0.6 5.67 41.53
## se
## X1 0
#setting the limits for x and y axis
plot_with_lim(redWineData$chlorides,0,0.5,"chlorides",0.02)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.09 0.05 0.08 0.08 0.01 0.01 0.61 0.6 5.67 41.53 0
#plotting for free.sulphur.dioxide
plot_no_lim(redWineData$free.sulfur.dioxide,"free.sulfur.dioxide")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 15.87 10.46 14 14.58 10.38 1 72 71 1.25 2.01
## se
## X1 0.26
#setting the limits for x and y axis
plot_with_lim(redWineData$free.sulfur.dioxide,0,45,"free.sulfur.dioxide",1)
## Warning: Removed 24 rows containing non-finite values (stat_boxplot).
## Warning: Removed 26 rows containing missing values (geom_point).
## Warning: Removed 24 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 15.87 10.46 14 14.58 10.38 1 72 71 1.25 2.01 0.26
#plotting for total.sulphur.dioxide
plot_no_lim(redWineData$total.sulfur.dioxide,"total.sulfur.dioxide")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 46.47 32.9 38 41.84 26.69 6 289 283 1.51 3.79
## se
## X1 0.82
#setting the limits for x and y axis
plot_with_lim(redWineData$total.sulfur.dioxide,0,180,"total.sulfur.dioxide",5)
## Warning: Removed 2 rows containing non-finite values (stat_boxplot).
## Warning: Removed 2 rows containing missing values (geom_point).
## Warning: Removed 2 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 46.47 32.9 38 41.84 26.69 6 289 283 1.51 3.79 0.82
#plotting for sulphates
plot_no_lim(redWineData$sulphates,"sulphates")
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 0.66 0.17 0.62 0.64 0.12 0.33 2 1.67 2.42 11.66 0
#setting the limits for x and y axis
plot_with_lim(redWineData$sulphates,0,2,"sulphates",0.1)
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 0.66 0.17 0.62 0.64 0.12 0.33 2 1.67 2.42 11.66 0
#plotting for density
plot_no_lim(redWineData$density,"density")
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 1599 1 0 1 1 0 0.99 1 0.01 0.07 0.92 0
#setting the limits for x and y axis
plot_with_lim(redWineData$density,0.5,1.5,"density",0.001)
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 1 0 1 1 0 0.99 1 0.01 0.07 0.92 0
#plotting for alcohol
plot_no_lim(redWineData$alcohol,"alcohol")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 10.42 1.07 10.2 10.31 1.04 8.4 14.9 6.5 0.86 0.19
## se
## X1 0.03
#setting the limits for x and y axis
plot_with_lim(redWineData$alcohol,8,14,"alcohol",0.1)
## Warning: Removed 1 rows containing non-finite values (stat_boxplot).
## Warning: Removed 7 rows containing missing values (geom_point).
## Warning: Removed 1 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 10.42 1.07 10.2 10.31 1.04 8.4 14.9 6.5 0.86 0.19 0.03
#plotting for total_acidity
plot_no_lim(redWineData$total_acidity,"total_acidity")
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 1599 8.85 1.7 8.45 8.69 1.42 5.12 16.29 11.17 0.97 1.23
## se
## X1 0.04
#setting the limits for x and y axis
plot_with_lim(redWineData$total_acidity,4,14,"total_acidity",0.1)
## Warning: Removed 13 rows containing non-finite values (stat_boxplot).
## Warning: Removed 13 rows containing missing values (geom_point).
## Warning: Removed 13 rows containing non-finite values (stat_bin).
Output: vars n mean sd median trimmed mad min max range skew kurtosis se X1 1 1599 8.85 1.7 8.45 8.69 1.42 5.12 16.29 11.17 0.97 1.23 0.04
The dataset has 13 variables and 1599 observation. The variables in the dataset are
fixed.acidity
volatile.acidity
citric.acid
residual.sugar
chlorides
free.sulfur.dioxide total.sulfur.dioxide density
pH
sulphates
alcohol
quality
The variable of interest is quality. We want to study the ariables that have impact on quality of wine.
The expectation is that citric acid,ph,residual sugar, alcohol and total acidity will contribute to the investigate the quality of wine. These factors contribute to the taste of wine determining its quality. So may be the mentioned variables to contribute to its impact on quality.
Yes , 2 new variables have been created. total_acidity, this the summation of the volatile acidity and fixed acidity as these 2 variables together determine the acidity of the wine. The second varaible created is rating, this categorises the wines based on their quality score in bad,average and good categories.
The x axis and y axis have been set to limits to have a closer view of the data. Plots with and after removing outliers have been plot to understand the distribution of data.
We will first ploat a scatterplot matrix, to understand the relation between 2 variables.
#ggpairs(redWineData, aes(colour = rating, alpha = 0.4))
#set.seed(666)
#ggpairs(redWineData[sample.int(nrow(redWineData),1000),])
#To increase the readability we will plot 5 variables agains quality first and then plot the remaining
ggpairs(redWineData,columns= c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","quality"), columnLabels = c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides","quality"),aes(colour = rating, alpha=0.4))
# plot graph for the remaining variables
ggpairs(redWineData,columns= c("free.sulfur.dioxide","total.sulfur.dioxide","pH","sulphates","alcohol","total_acidity","quality"), columnLabels = c("free.sulfur.dioxide","total.sulfur.dioxide","pH","sulphates","alcohol","total_acidity","quality"),aes(colour = rating, alpha=0.4))
## observations from the scatterplot matrix * There are no variables that have strong corelation with quality. * From comparison of the corelation coefficient all variables with quality, the below seem to have some relation with the quality alcohol(0.476), volatile acidity (-0.391), sulphates (0.476) and citric acid(0.226) * There is strong corelation between citric acid and fixed and volatile acidity. * There is strong co relation between total sulfur dioxide and free sulfur dioxode, ph and total acidity
# function to plot different variables against quality
plot_relation_graph <- function(xvar,yvar,a1,xvar_name,yvar_name)
{
title <- paste(xvar_name, "Content vs", yvar_name, sep = " ")
ggplot(aes(x=xvar, y=yvar), data=redWineData) +
geom_jitter(alpha=a1) +
geom_smooth(method = "lm", se = FALSE) + labs(x = xvar_name, y= yvar_name) +
ggtitle(title)
}
#Plot relation between alcohol and quality
plot_relation_graph(redWineData$alcohol,redWineData$quality,0.66,"alcohol","quality")
#Plot relation between volatile acidity and quality
plot_relation_graph(redWineData$volatile.acidity,redWineData$quality,0.5,"volatile.acidity","quality")
#Plot relation between residual sugar and quality
plot_relation_graph(redWineData$residual.sugar,redWineData$quality,0.5,"residual.sugar","quality")
#Plot relation between alcohol and quality
plot_relation_graph(redWineData$pH,redWineData$quality,0.5,"pH","quality")
#Plot relation between sulphates and quality
plot_relation_graph(redWineData$sulphates,redWineData$quality,0.66,"sulphates","quality")
#Plot relation between citric acid and quality
plot_relation_graph(redWineData$citric.acid,redWineData$quality,0.5,"citric acid","quality")
#Plot relation between citric acid and volatile acidity
plot_relation_graph(redWineData$citric.acid,redWineData$volatile.acidity,0.2,"citric.acid","volatile.acidity")
#Plot relation between citric acid and fixed acidity
plot_relation_graph(redWineData$citric.acid,redWineData$fixed.acidity,0.2,"citric acid","fixed acidity")
#Plot relation between total sulphur dioxide and quality
plot_relation_graph(redWineData$total.sulfur.dioxide,redWineData$quality,0.2,"total sulphur dioxide","quality")
#Plot relation between free sulphur dioxide and quality
plot_relation_graph(redWineData$free.sulfur.dioxide,redWineData$quality,0.2,"free sulphur dioxide","quality")
#Plot relation between density and quality
plot_relation_graph(redWineData$density,redWineData$quality,0.2,"density","quality")
Now that there is a relation between the 4 variables and quality we will plot a box plot showing the content of the variables in the rating column
plot_rel_rating <- function(yvar,yvar_name)
{
title <- paste0(yvar_name," vs Wine quality")
ggplot(redWineData, aes(x=rating, y=yvar,fill=rating)) +
geom_boxplot()+
xlab("wine category") + ylab(yvar_name) +
ggtitle(title)
}
plt1 <- plot_rel_rating(redWineData$alcohol,"Alcohol")
plt2 <- plot_rel_rating(redWineData$alcohol,"Sulphates")
plt3 <- plot_rel_rating(redWineData$alcohol,"Volatile.acidity")
plt4 <- plot_rel_rating(redWineData$alcohol,"Citric.acid")
grid.arrange(plt1,plt2,plt3,plt4)
It has been observed that there is a strong relation between pH and total acidity. Also there has been a strong relation observed between citric acid and fixed acidity and citric acid and volatile acidity.
Relative to quality, alcohol had the strongest relation. Relative to all other ‘different’ variables citric acid and fixed acidity have strong relation.
Now we will plot multiple variable plots to conclude on the factors that impact wine quality.
We have seen that alcohol has a strong relation with quality, hence we will try to plot different variables with alocohol and quality and try to understand if any of them together have impact on the quality of wine.
multi_rel_plot <- function (yvar,xvar)
{
ggplot(data = redWineData,
aes_string(y = yvar, x = xvar,
color = as.factor("quality"))) +
geom_point(alpha = 0.8, size = 1) +
geom_smooth(method = "lm", se = FALSE,size=1) + labs(x = xvar,y= yvar) +
scale_color_manual(values=c("#999999", "#E69F00", "#56B4E9"), guide=guide_legend(title="Quality"))
}
multi_rel_plot("density","alcohol")
multi_rel_plot("sulphates","alcohol")
multi_rel_plot("residual.sugar","alcohol")
multi_rel_plot("pH","alcohol")
Now we will plot graphs by fixing the acidity
multi_rel_plot("citric.acid","fixed.acidity")
multi_rel_plot("residual.sugar","fixed.acidity")
multi_rel_plot("density","volatile.acidity")
### Observations: Quality is high when volatile acidity and density are low Quality gets high with more alcohol and less sulphates Wine has good quality when the amount of alcohol is more and volatile acidity is less. Density has the weakest correlations with quality Residual sugar has no impact on quality
plot_scatter <- function(xvar_name,yvar_name)
{
title <- paste(xvar_name," by ",yvar_name)
ggplot2.scatterplot(data=redWineData, xName=xvar_name, yName= yvar_name, size=3,
groupName="rating",
groupColors=c('#999999','#E69F00','#56B4E9'),
addRegLine=TRUE, fullrange=TRUE, setShapeByGroupName=TRUE,
backgroundColor="white",
xtitle=xvar_name, ytitle=yvar_name,
mainTitle=title,
faceting=TRUE, facetingVarNames="rating"
)
}
#plot for showing effect of volatile acidity and alcohol together
pl1 <- plot_scatter("alcohol","volatile.acidity")
#PLot for alcohol and sulphates together with rating
pl2 <- plot_scatter("alcohol","sulphates")
#PLot for alcohol and chlorides together with rating
pl3 <- plot_scatter("alcohol","chlorides")
#PLot for alcohol and citric acid together with rating
pl4 <- plot_scatter("alcohol","citric.acid")
grid.arrange(pl1,pl2,pl3,pl4)
#plot a graph to volatile acidity and density on rating
plot_scatter("volatile.acidity","density")
#plot graph with residual.sugar and density to show that they have the weakest relation with quality of wine
p1 <- plot_scatter("alcohol","density")
p2 <- plot_scatter("alcohol","residual.sugar")
grid.arrange(p1,p2)
We will try to plot a linear model based on the data we haveanalysed so far:
plt1 <- lm(quality ~ alcohol, data = redWineData) summary(plt1)
#understanding the linear model
plt1 <- lm(quality ~ alcohol, data = redWineData)
summary(plt1)
##
## Call:
## lm(formula = quality ~ alcohol, data = redWineData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8442 -0.4112 -0.1690 0.5166 2.5888
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.87497 0.17471 10.73 <2e-16 ***
## alcohol 0.36084 0.01668 21.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.7104 on 1597 degrees of freedom
## Multiple R-squared: 0.2267, Adjusted R-squared: 0.2263
## F-statistic: 468.3 on 1 and 1597 DF, p-value: < 2.2e-16
m1 <- lm((quality ~ alcohol), data = redWineData)
m2 <- update(m1, ~ . + citric.acid)
m3 <- update(m2, ~ . + chlorides)
m4 <- update(m3, ~ . + residual.sugar)
m5 <- update(m4, ~ . + total_acidity)
m6 <- update(m5, ~ . + sulphates)
mtable(m1, m2, m3, m4, m5,m6)
##
## Calls:
## m1: lm(formula = (quality ~ alcohol), data = redWineData)
## m2: lm(formula = quality ~ alcohol + citric.acid, data = redWineData)
## m3: lm(formula = quality ~ alcohol + citric.acid + chlorides, data = redWineData)
## m4: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar,
## data = redWineData)
## m5: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar +
## total_acidity, data = redWineData)
## m6: lm(formula = quality ~ alcohol + citric.acid + chlorides + residual.sugar +
## total_acidity + sulphates, data = redWineData)
##
## ======================================================================================================
## m1 m2 m3 m4 m5 m6
## ------------------------------------------------------------------------------------------------------
## (Intercept) 1.875*** 1.830*** 2.056*** 2.085*** 2.000*** 1.719***
## (0.175) (0.171) (0.186) (0.187) (0.233) (0.228)
## alcohol 0.361*** 0.346*** 0.333*** 0.334*** 0.336*** 0.311***
## (0.017) (0.016) (0.017) (0.017) (0.017) (0.017)
## citric.acid 0.730*** 0.798*** 0.814*** 0.767*** 0.549***
## (0.090) (0.092) (0.093) (0.121) (0.120)
## chlorides -1.218** -1.200** -1.179** -2.564***
## (0.389) (0.390) (0.391) (0.408)
## residual.sugar -0.017 -0.017 -0.010
## (0.012) (0.012) (0.012)
## total_acidity 0.008 0.009
## (0.013) (0.013)
## sulphates 1.068***
## (0.113)
## ------------------------------------------------------------------------------------------------------
## R-squared 0.227 0.257 0.262 0.263 0.263 0.302
## adj. R-squared 0.226 0.256 0.261 0.261 0.261 0.300
## sigma 0.710 0.696 0.694 0.694 0.694 0.676
## F 468.267 276.595 188.675 142.024 113.650 114.880
## p 0.000 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -1721.057 -1688.711 -1683.819 -1682.921 -1682.733 -1639.019
## Deviance 805.870 773.917 769.196 768.333 768.153 727.280
## AIC 3448.114 3385.421 3377.637 3377.842 3379.467 3294.037
## BIC 3464.245 3406.930 3404.523 3410.105 3417.106 3337.054
## N 1599 1599 1599 1599 1599 1599
## ======================================================================================================
There were few strong relationships that identified in bivariate analysis when combined together had impact on the quality of wine here are the observation: * Quality is high when volatile acidity and density are low * Quality gets high with more alcohol and less sulphates * Wine has good quality when the amount of alcohol is more and volatile acidity is less. Also there were few variables that when added alongwith alcohol showed no impact on the quality. * Density has the weakest correlations with quality * Residual sugar has no impact on quality
Earlier it was assumed that pH and citric acid will have great amount of impact on deciding the quality of the wine. But it was noted that these variables did not have significant impact on the quality of the wine. There were variables like volatile acidity and sulphates which if present in less amount will produce good quality wines.
ggplot(data=redWineData, aes(x=quality), title("Quality histogram
")) + geom_bar(width = 1, color = 'Brown', fill ='sky blue')
##Plot 1 description
It can be noted that the dataset provided contains average quality wines. There are very few observations for good and bad quality wines. This constraint makes it difficult to determine the factors that will impact the quality of wine.
plot_rel_rating <- function(yvar,yvar_name)
{
title <- paste0(yvar_name," vs Wine quality")
ggplot(redWineData, aes(x=rating, y=yvar,fill=rating)) +
geom_boxplot()+
xlab("wine category") + ylab(yvar_name) +
ggtitle(title)
}
plt1 <- plot_rel_rating(redWineData$alcohol,"Alcohol")
plt2 <- plot_rel_rating(redWineData$alcohol,"Sulphates")
plt3 <- plot_rel_rating(redWineData$alcohol,"Volatile.acidity")
plt4 <- plot_rel_rating(redWineData$alcohol,"Citric.acid")
grid.arrange(plt1,plt2,plt3,plt4)
The above plot show that alcohol,sulphates ,volatile.acidity and citric acid have strong corelation with the quality.
plot_scatter <- function(xvar_name,yvar_name)
{
title <- paste(xvar_name," by ",yvar_name)
ggplot2.scatterplot(data=redWineData, xName=xvar_name, yName= yvar_name, size=3,
groupName="rating",
groupColors=c('#999999','#E69F00','#56B4E9'),
addRegLine=TRUE, fullrange=TRUE, setShapeByGroupName=TRUE,
backgroundColor="white",
xtitle=xvar_name, ytitle=yvar_name,
mainTitle=title,
faceting=TRUE, facetingVarNames="rating"
)
}
#plot for showing effect of volatile acidity and alcohol together
pl1 <- plot_scatter("alcohol","volatile.acidity")
#PLot for alcohol and sulphates together with rating
pl2 <- plot_scatter("alcohol","sulphates")
#PLot for alcohol and chlorides together with rating
pl3 <- plot_scatter("alcohol","chlorides")
#PLot for alcohol and citric acid together with rating
pl4 <- plot_scatter("alcohol","citric.acid")
grid.arrange(pl1,pl2,pl3,pl4)
#Plot 3 description It is observed that we can get good quality wine when the volatile.acidity and sulphates amount are less and alcohol content is high. There is no impact of density and pH on quality of wine.
It was thought that pH and density will contribute a major role on the quality of wine, before beginning the bivariate and multivariate analysis. It was only alcohol that played the part to the quality before and after the analysis.
After the analysis it was found that high amount of alcohol and less amount of sulphates and volatile acidity can produce good quality wines.
For future work if the dataset with good and bad rating wines is procured, the variables impacting the quality can be better determined.